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Abstract 

Background: Multi-locus sequence typing (MLST) has become the gold standard for population analyses of 
bacterial pathogens. This method focuses on the sequences of a small number of loci (usually seven) to divide the 
population and is simple, robust and facilitates comparison of results between laboratories and over time. Over the 
last decade, researchers and population health specialists have invested substantial effort in building up public 
MLST databases for nearly 100 different bacterial species, and these databases contain a wealth of important 
information linked to MLST sequence types such as time and place of isolation, host or niche, serotype and even 
clinical or drug resistance profiles. Recent advances in sequencing technology mean it is increasingly feasible to 
perform bacterial population analysis at the whole genome level. This offers massive gains in resolving power and 
genetic profiling compared to MLST, and will eventually replace MLST for bacterial typing and population analysis. 
However given the wealth of data currently available in MLST databases, it is crucial to maintain backwards 
compatibility with MLST schemes so that new genome analyses can be understood in their proper historical 
context. 

Results: We present a software tool, SRST, for quick and accurate retrieval of sequence types from short read sets, 
using inputs easily downloaded from public databases. SRST uses read mapping and an allele assignment score 
incorporating sequence coverage and variability, to determine the most likely allele at each MLST locus. Analysis of 
over 3,500 loci in more than 500 publicly accessible lllumina read sets showed SRST to be highly accurate at allele 
assignment. SRST output is compatible with common analysis tools such as eBURST, Clonal Frame or PhyloViz, 
allowing easy comparison between novel genome data and MLST data. Alignment, fastq and pileup files can also 
be generated for novel alleles. 

Conclusions: SRST is a novel software tool for accurate assignment of sequence types using short read data. 
Several uses for the tool are demonstrated, including quality control for high-throughput sequencing projects, 
plasmid MLST and analysis of genomic data during outbreak investigation. SRST is open-source, requires Python, 
BWA and SamTools, and is available from http://srst.sourceforge.net. 

Keywords: MLST, Short read, lllumina, Sequence analysis, Plasmid, Chromosome, Microbiology, Bacteria, Population 
analysis, Outbreak 



Background 

Multi-locus sequence typing (MLST) has become the 
gold standard for the analysis of bacterial populations 
[1,2]. MLST involves PCR amplification and sequencing 
of 5-10 loci of -500 bp in length, with each sequence 
variant assigned a unique locus variant or allele number. 
Each unique combination of locus variants is assigned a 
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sequence type (ST), which is then used to denote a pre- 
cise set of sequences. Public MLST databases are used 
to store and share information linking DNA sequences 
to locus variant numbers and sequence types and are 
available for >85 bacterial species including important 
human pathogens such as Staphylococcus aureus, Hae- 
mophilus influenzae and Neisseria species (see http:// 
pubmlst.org). This format allows quick, simple and dir- 
ect comparison of bacterial populations analysed in dif- 
ferent laboratories and over time. The databases also 
link individual bacterial isolates to STs, serotypes, 



© 2012 Inouye et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
BiolVlGCl C6ntTcll Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Inouye et al. BMC Genomics 2012, 13:338 
http://www.biomedcentral.eom/1 471 -21 64/1 3/338 



Page 2 of 7 



sources and other meta-information with public health 
utility. 

Advances in sequencing technology continue to bring 
down the costs of whole-genome sequencing of bac- 
teria - indeed it is already close to the cost of MLST 
[3]. Because bacterial genomes are relatively small in 
size (typically 1-10 Mbp) they can be sequenced in 
multiplex, allowing the high read depths of e.g. a single 
lane of Illumina HiSeq to be distributed among up to 
96 different samples via the use of individually-tagged 
libraries [4-7]. Whole-genome sequencing provides a 
lot more information than MLST and can be used to 
study microevolution in much finer detail and over 
small time scales [3-8]. As an assay, whole genome 
shotgun sequencing can be simpler than MLST, which 
requires multiple independent PCR and sequencing 
reactions to be performed for each sample. This will 
become increasingly true as automation becomes 
widely used in whole-genome library preparation work- 
flows. Also, while MLST requires the design and pur- 
chase of species-specific specific primer sets, whole 
genome sequencing can be applied to any bacterium, 
without species primers or even prior knowledge of 
species. However, while the advantages of whole- 
genome sequencing over MLST are clear, it is crucial 
that newly sequenced isolates (or populations of iso- 
lates) can be analysed in the context of the vast 
amount of population data currently stored in MLST 
databases. MLST can be used to assess the frequency 
and expansion of particular clones, and most MLST 
databases store not only ST information for each iso- 
late, but also detailed meta-information such as sero- 
type, host or source, date of collection and in some 
cases spatial information (see e.g. MLST-maps, http:// 
maps.mlst.net [9]). MLST provides a framework against 
which new isolate collections can be compared and 
interpreted, but can only be utilized if the STs of newly 
sequenced isolates can be accurately derived from 
whole-genome sequencing data. 

The only method currently available to perform MLST 
allele assignment on short read data is web-based, re- 
quiring sequence data to be uploaded to a server and 
compared to public MLST databases [10]. This poses 
problems for data security and confidentiality, is unfeas- 
ible for the large datasets typically generated in high- 
throughput multiplex sequencing projects and excludes 
the use of privately maintained MLST databases. All of 
these issues are likely to be significant barriers for use in 
the majority of research or public health laboratories. 
Furthermore the method depends on de novo assembly 
[10], which limits its sensitivity, particularly for genomes 
sequenced at low read depth. 

Here we present an open-source software tool, SRST, 
to derive STs from Illumina short read sequence data 



using a mapping-based approach to maximise sensitivity. 
SRST can be used together with any public or private 
MLST scheme and generates output files suitable for 
comparative analysis with existing MLST datasets, com- 
patible with standard MLST tools such as eBURST [11], 
ClonalFrame [12] and Phyloviz (http://www.phyloviz. 
net). In this paper, we introduce the SRST approach and 
demonstrate its accuracy with real datasets including 
534 genomes from four species-specific and four plas- 
mid MLST schemes, and discuss the usefulness of the 
method for quality control in high-throughput sequen- 
cing projects and outbreak investigations. 

Implementation 

SRST takes as input (a) locus variant sequences and ST 
profile definitions, retrieved from a public or private MLST 
database (such as http://pubmlst.org); (b) sequences flank- 
ing each locus, retrieved from an appropriate reference 
genome sequence using the supplied script; (c) Illumina 
read data (in fastq format; any number of paired or 
single-end read files can be processed in a single com- 
mand). SRST runs on any Linux based computer or clus- 
ter (including Mac OS X) and requires the installation of 
the free packages BWA and SamTools for alignment 
functions [13,14]. Full instructions are available at http:// 
srst.sourceforge.net. 

Each readset is mapped to each of the possible locus 
variants v (with flanking sequence) and a score s is cal- 
culated to assess the quality of the match, as follows. 
Consider a single base position i in the mapping of read 
set R to locus variant v, in which rit reads map to pos- 
ition i, which has base v t in v and majority-rules consen- 
sus (i.e. most prevalent) base r t in R. If r t ^ v t we record 
a mismatch and rule out v as a possible locus variant. 
Otherwise, we compute the probability that the base call 
Ti (which matches v-) is erroneous, by calculating Bino- 
mial probabilities for the three alternative bases x: 

pr(x> X ) = j2( n k t )p i (i-py H 

j=k ^ ' 

where k = the observed read count for base x at position 
L The probability that a non-consensus nucleotide 
(which does not match v,) is the best explanation for the 
coverage at position i is the sum of the three probabil- 
ities. The probability that a sequence other than v is the 
best explanation for the observed coverage across the 
whole locus is the sum of these probabilities across all 
positions L The final score s reported is the negative log 
of this probability, so that higher scores reflect more 
hits. Note this treats the probabilities at each position as 
independent, which they are not. However the assump- 
tion is conservative, and only results in non-trivial over- 
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estimation of the probability in cases where the score 
will be below a threshold of acceptance in any case. 

For each readset and locus, the highest scoring variant, 
with zero mismatches and passing a user-settable cut-off 
value, is assigned. If all loci are assigned for a particular 
readset and the combination of variants is a known ST, 
this ST will be returned; if a novel combination of locus 
variants is detected this will be assigned a novel ST. If 
any loci are not confidently assigned, no ST can be 
called. 

The main output is a file specifying the locus variants 
and STs of all input datasets, suitable for analysis with 
common MLST tools such as eBURST or Phyloviz. 

A log file is also generated, detailing scores and cover- 
age statistics for each locus and readset. Where an exact 
match to a known allele cannot be found, the closest al- 
lele (and number of mismatching bases) is reported and 
the closest ST is determined; these results are flagged so 
as to be distinguished from precise ST assignments. 

Optionally, verbose output can be switched on in 
order to retain full sequence information for novel 
alleles, including the alignment (bam format), pileup 
(pileup format) and consensus sequence (fastq format) 
obtained from mapping to the closest-matching locus 
variant. See Additional file 1 for an example fastq gener- 
ated by SRST. This is intended to facilitate investigation 
of novel alleles, in which case visual inspection of align- 
ments or pileups is recommended. Some MLST data- 
bases may also accept fastq or bam files for submission 
of novel alleles. 

Results and Discussion 

Accuracy in calling sequence types 

To determine a suitable cut-off score and test the accur- 
acy of SRST, we utilized three publicly available datasets 
each representing a different group of bacteria - Strepto- 
coccus pneumoniae [6,15], Staphylococcus aureus [4,16] 
and Salmonella bongori [17,18]. All short read, MLST 
and reference data were downloaded from public data- 
bases (Table 1). We ran SRST on each read set (N = 341 
genomes, 2,387 loci) and examined the sensitivity (call 
rate; i.e. the proportion of loci for which a variant could 
be confidently assigned) and specificity (false positive 
rate; i.e the proportion of loci with incorrect variant 
calls) obtained using different cut-off scores. As Figure 1 
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Figure 1 Score threshold determination for SRST. Each data 

point indicates the call rate (y-axis) and false positive rate (x-axis) for 

a given cut-off score, labeled in blue. False positive rate = proportion 

of loci with incorrect variant calls, call rate = proportion of loci for 

which a variant could be confidently assigned, total loci = 2,387 (341 

samples, 3 species), 
v J 

shows, best results were obtained with a cut-off score 
between 7 and 11. We therefore set the default cut-off 
score for SRST to 10, and use this cut-off for all subse- 
quent analyses reported below. In the few cases where 
an allele could not be confidently assigned using a cut- 
off score of 10, the expected allele and ST were still cor- 
rectly identified by SRST as the most likely result. 

Quality control in high-throughput sequencing projects 

The most obvious application of SRST is to assign STs 
to novel isolates whose STs are unknown. However we 
were also interested to use SRST for quality control in 
large-scale sequencing studies of bacterial clones, to 
allow early detection and identification of read sets that 
should be excluded from comparative analysis of the 
clonal group of interest. To demonstrate the utility of 
SRST for this purpose, we used it to analyse a set of 
Shigella sonnei genomes sequenced using paired-end 
multiplex Illumina GAII (Table 1) [19]. Note Shigella 
species are actually sublineages of E. coli [20], hence the 
E. coli MLST scheme [21] is used to investigate Shigella, 
A total of 188 Shigella data sets had sufficient mean 
read depth (>10x) to analyse and 170 (90 %) of these 
matched a known ST with SRST scores >10 for all loci. 
The other 18 samples comprised (i) 15 with >1 locus 
scoring <10 but identified as a known S. sonnei allele 
and (ii) 3 with novel locus variants (each having one 



Table 1 High throughput read sets analyzed in this study 


Species 


N 


Read type 


Accession (reads) 


MLST database 


Reference genome 


Staphylococcus aureus [4] 


67 


37 bp SE 


ERP000070 


http://saureus.mlst.net/ 


NC_002952.2 


Salmonella bongori [17] 


18 


54 bp PE 


ERP000328 


http://mlst.ucc.ie/mlst/ 


NC_01 1900.1 


Streptococcus pneumoniae [6] 


256 


54 bp PE 


ERP000139 


http://spneumoniae.mlst.net/ 


NC_01 1149.1 


Shigella sonnei [19] 


188 


54 bp PE 


ERP000182 


http://mlst.ucc.ie/mlst/ 


NC_000913.2 



PE = paired end sequencing, SE = single end sequencing. 
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locus for which the highest scoring match (scores 20- 
160) differed from a known S. sonnei allele by one mis- 
matching base). Hence all 18 were recognizable as 
single-locus variants of known S. sonnei STs, confirming 
their suitability for inclusion in a S. sonnei study. Con- 
sensus sequences and quality scores for the novel alleles, 
generated using SRSTs verbose option, are given in 
Additional file 1. Of the 170 STs assigned with score 
>10, 166 matched known S. sonnei STs (ST152, ST1502, 
ST1504, ST1505). Four matched those of other, non- 
sonnei, Shigella (S. flexneri, ST245; S. boydii, ST243, 
ST1025), indicating they were erroneously included in 
the S. sonnei isolate collection and should be excluded 
from the intended analysis of S. sonnei. The species sta- 
tus of these four isolates, identified by SRST, was con- 
firmed by serotyping of the original isolates, and 
comparison of the read sets to reference genomes of S. 
flexneri and S. boydii. Hence SRST could successfully de- 
tect and identify outliers for removal. In contrast, allele 
assignment by blastn search of de novo assembled con- 
tigs (assembled using Velvet 1.0.13 and Velvet Optimiser 
2.1.7) succeeded for only 60 % of the Shigella read sets. 



Plasmid MLST 

There are currently MLST schemes available for four 
types of plasmid - Incll [22], IncN [23], IncHIl [24] 
and IncHI2 [25] (http://pubmlst.org/plasmid/). To test 
the accuracy of SRST for plasmid MLST, we used it to 
detect and assign 5-locus STs to Incll plasmids in the S. 
sonnei dataset. Since we do not have traditional plasmid 
MLST sequences available as a control, we mapped the 
reads to the reference sequence for Incll ST 16 plasmid 
pEK204 (NC_013120) [26] and used the proportion of 
the plasmid covered by each read set as a measure of the 
real presence of Incll plasmids in the data (e.g. 90 % 
coverage of the reference plasmid would indicate an 
Incll plasmid is present; 10 % coverage of the reference 
plasmid would indicate no Incll plasmid is present). As 
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Figure 2 Use of SRST to detect plasmids. X-axis indicates the 
number of Incll plasmid MLST loci with a high-confidence allele 
assignment (score > 10) using SRST; y-axis indicates the proportion of 
the IncM ST16 reference plasmid, pEK204, covered by each read set. 



Figure 2 shows, there was a strong correlation between 
the number of Incll plasmid MLST loci that could be 
assigned with confidence by SRST, and the coverage of 
the Incll reference plasmid, indicating that SRST is use- 
ful for screening for the presence of specific plasmid 
types. A total of eight read sets were assigned the same 
Incll ST (16) as pEK204; their coverage of pEK204 ran- 
ged from 94.2 %-99.0 % (mean 96.9 %), while the highest 
coverage of pEK204 in other read sets was 92.3 % 
(assigned to ST37). This suggests that SRSTs assign- 
ment of plasmid STs is as accurate as that for chromo- 
somal STs. 

High-confidence Incll plasmid STs were assigned to a 
total of 26 S. sonnei. As Figure 3 shows, these represent 
a variety of very distinct Incll plasmid types, indicative 
of multiple transfers of divergent Incll plasmids into S. 
sonnei. The plasmid STs clustered geographically and, 
within geographic regions, temporally (Figure 3), sug- 
gesting there have been several, highly localized, trans- 
fers of distinct Incll plasmids into the global S. sonnei 
population. 

Outbreak analysis 

Chromosomal and plasmid MLST can provide useful 
insights into bacterial pathogen outbreaks. SRST allows 
these insights to be rapidly extracted from whole- 
genome shotgun sequencing, which can be performed 
without prior knowledge of the species and with no need 
for PCR with species-specific MLST primers. To illus- 
trate this, we utilised five Illumina short read data sets 
from the outbreak of E. coli O104:H4 causing hemolytic 
uremic syndrome in Germany in 2011 [27,28] (accessions 
SRP000285, SRP008003, SRP008032-36, SRP007327). 
The data was generated in two different sites (BGI, China 
and Broad Institute, US). We used SRST to screen the 
outbreak data using the E. coli MLST database to identify 
the chromosomal ST and all publicly available plasmid 
MLST databases to identify Incll, IncN, IncHIl or 
IncHI2 plasmids (see above). Details of the datasets and 
results are provided in Table 2. 

SRST correctly identified the chromosomal ST of all 
five outbreak isolates as E. coli ST678, which matches 
that reported using traditional MLST approaches [29]. 
The closest available finished reference genome se- 
quence to the E. coli outbreak strain, Ec55989, shares 
this ST, and has formed the reference for phylogenetic 
and gene content analyses of the German outbreak in all 
published studies [27-30]. This illustrates the utility of 
SRST to rapidly identify the most suitable reference se- 
quence for whole-genome analysis during an outbreak. 
At the time of the German outbreak, Ec55989 had not 
been entered into the E. coli MLST database and exten- 
sive read mapping to all available E. coli sequences (ap- 
proximately N = 60 at the time) was required to identify 
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Figure 3 Incll plasmid STs detected among S. sonnei. Minimum spanning tree of all IncM STs present in the IncM plasmid MLST database as 
at March 13, 2012, generated using Phyloviz (http://www.phyloviz.net). STs detected among 5. sonnei read sets are highlighted; node colour 
indicates the country of origin according to the legend provided (bottom-right); node size indicates the number of isolates according to the inset 
circle legend (top-left). Nodes representing known STs are labeled in black with the ST number (7, 13, 16, 26, 27, 55), unlabeled nodes are novel 
STs (novel combinations of known alleles). Dates of isolation are indicated for larger groups (coloured text). Arrow indicates the position of ST31, 
identified in isolates of £ coli O104:H4 associated with an outbreak in Germany in 201 1. 



the most suitable reference [28], however assuming all 
available reference sequences are entered into the rele- 
vant MLST databases, this could be achieved much 
more quickly and easily in future using SRST. 

SRST also correctly identified the presence of an Incll 
plasmid of ST31 in each of the E. coli O104:H4 outbreak 
isolates (arrow in Figure 3). The outbreak strains anti- 
biotic resistance plasmid was previously confirmed as 

Table 2 Chromosomal and plasmid analysis of E. coli 
outbreak strains 



Strain 


E. coliSJ (score) 


Incll ST (score) 


Accession 


Reference 


C227-1 1 


ST678 (24660) 


ST31 (2320) 


SRP000285 


[27] 


C236-1 1 


ST678 (27131) 


ST31 (2723) 


SRP008003 


[27] 


11-3677 


ST678 (20353) 


ST31 (11286) 


SRP008034 


[27] 


11-3798 


ST678 (10400) 


ST31 (6848) 


SRP008035 


[27] 


P/-2482 


ST678 (1042) 


ST31 (326.5) 


SRP007327 


[28] 



The SRST ST calls and scores are indicated for the chromosome (£ coli 
scheme) and plasmid (IncM scheme). No other plasmids with MLST schemes 
available (IncN, IncHU, lncHI2) were detected using SRST 



Incll ST31 using traditional plasmid MLST (present in 
the Incll database as CTX-Il-O104:H4, see http:// 
pubmlst.org/plasmid/). As Figure 3 shows, this plasmid 
is quite divergent from any we detected in the S. sonnei 
data. The Incll plasmid MLST database shows Incll 
ST31 plasmids have previously been identified in a var- 
iety of other E. coli hosts circulating in both humans 
and animals, often containing extended spectrum beta- 
lactamase CTX-M genes similar to that encoded in the 
outbreak isolates' Incll plasmids (see Incll database at 
http://pubmlst.org/plasmid/). No IncN, IncHIl or IncHI2 
plasmids were identified by SRST, consistent with pub- 
lished reports of the outbreak genomes [27-30]. 

Other potential applications 

As SRST is database driven it could be used for other se- 
quence typing tasks beyond MLST, provided appropriate 
databases are used as input. For example, it could be 
used to annotate drug resistance genes and alleles. Used 
in conjunction with the recent ribosomal MLST 
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database [31], SRST could potentially be used for species 
designation of novel isolates. These applications could 
be useful in outbreak analysis, strain identification, sur- 
veillance, studies of mechanisms and transfer of drug re- 
sistance, and a variety of other public health and 
research applications. 

Conclusions 

SRST uses read mapping to assign sequence types to 
novel bacterial genomic sequence data, which offers sev- 
eral advantages over traditional MLST using PCR and 
Sanger sequencing. SRST is accurate and sensitive at al- 
lele assignment and can identify and retrieve novel allele 
sequences for further investigation. Being mapping- 
based it is more sensitive than assembly-based allele as- 
signment for short read data sets, and can be run locally 
without reliance on web-based services or data uploads. 
SRST can be used in a variety of contexts, including sim- 
ple allele assignment to novel data sets, quality control 
in batch sequencing projects, outbreak investigation, 
plasmid MLST and potentially in any scenario where 
database-driven sequence typing is required. 

Availability and requirements 

Project name: SRST (Short Read Sequence Typing) 
Project home page: http://srst.sourceforge.net/ 
Operating system(s): Linux/Mac 
Requirements: samtools 0.1.8, BWA 0.5.7 (open-source) 
Programming language: Python 2.6.4 
License: BSD 

Any restrictions to use by non-academics: No 
Additional file 



Additional file 1 Example of novel allele output using the verbose 
option. Consensus sequences and quality scores for three novel Shigella 
Sonne's alleles identified using SRST. 
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